Maximizing Performance: Strategies for Code Optimization

Valeria Duran

What Is Optimization?

An act, process, or methodology of making something (such as a design, system, or decision) as fully perfect, functional, or effective as possible. – Merriam-Webster

Code optimization is the process of enhancing code quality and efficiency.

xkcd comics

Pre-Optimization Steps

  • Make the code work.
  • Don’t repeat yourself.
  • Write clean code and document it well.
  • Don’t try to reinvent the wheel.

“Make it work, then make it beautiful, then if you really, really have to, make it fast. 90 percent of the time, if you make it beautiful, it will already be fast. So really, just make it beautiful!” –Joe Armstrong

Optimization should be the final step of your programming practice. Performance can be high following the pre-optimization steps. Only optimize when it’s necessary!

Steps to Optimize Code

  1. Make sure code runs as expected, is clean, is well documented, and is created with the end user in mind.
  2. Profile code (identify where the slow code is)
  3. Find other solutions
    1. Vectorize code when possible
    2. Use parallelization techniques
    3. Cache frequently used data
    4. Manage memory
    5. Find a faster package/function
  4. Benchmark code (compare code with your solution)
  5. Execute!

Scenario 1: Working With Big Data!

Goal: build a model using the insurance claims of ~25 million lives with two years’ worth of data (i.e., billions of records!) to estimate the cost of a procedure.

Steps:

  1. Partition data into categories/sections (body region).
  2. Create R scripts with tidyverse on a small dataset (Houston, Texas population for 1 year).
  3. Apply same scripts to a larger dataset (Texas population for 1 year).
  4. Update scripts to use data.table instead of tidyverse.
  5. Increase Windows instance (increasing RAM from 16GB to 32GB).  
  6. Run scripts on body regions (nationwide, and on two years of data).

Took months to complete…

…Other solutions are also possible!

Scenario 2: End User in Mind

# R Studio Server: 420ms

# R Studio Desktop Below

library(profvis)
library(git2r)

profvis({
  git_object <-
    function(data_object = NULL) {
      object_names <- sort(unique(subset(odb_blobs(), grepl(".rda", name))$name))
      
      git_obj <- grep(data_object, object_names, value = TRUE, ignore.case = TRUE)
      
      return(git_obj)
      
    }
  
  git_object("pkg")
})
library(microbenchmark)

microbenchmark(git_object <- sort(unique(subset(odb_blobs(), grepl(".rda", name))$name)),
               git_object2 <- sort(unique(gsub(".*/", "", system2("git" , "ls-files *.rda" , stdout = TRUE)))),
               times = 10)
Unit: seconds
                                                                                               expr
                          git_object <- sort(unique(subset(odb_blobs(), grepl(".rda", name))$name))
 git_object2 <- sort(unique(gsub(".*/", "", system2("git", "ls-files *.rda",      stdout = TRUE))))
        min         lq       mean     median         uq        max neval cld
 114.850255 116.144924 129.619974 117.037402 118.206581 190.981746    10  a 
   1.018575   1.085838   1.806733   2.169441   2.265821   2.535425    10   b

… using base::system2() produces much faster results than git2r::odb_blobs(). Sometimes the newer thing out there isn’t always the best option!

Useful R Tools

  • Profiling packages: {profvis} and {profile}.
  • Benchmarking package: {microbenchmark} and {bench}.
  • Caching packages: {memoise} (non-persistent by default) and {R.cache} (persistent). 
  • Parallel computing packages: {snow} and {parallel}.
  • Out-of-memory data packages: {ff}, {bigmemory}, and {feather}.
  • Use gc() to release memory (not needed, but it doesn’t hurt to use after removing large objects).
  • Use {data.table} for faster computations.

Can Optimizing Code Be a Bad Thing?

“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.”
- Donald Knuth, Computer Programming as an Art

When Is Optimizing Code Bad?

  • When code becomes less readable.
  • When performance improvement is minuscule.
  • When the time needed to optimize is longer than the task at hand.
  • When code is not used frequently enough.

TIME!

Conclusion

  • When writing code, consider the end user. What the majority will use might not be what you use. This will save you a lot of future rework.
  • Trustworthy code is oftentimes better than fast code.
  • Don’t optimize unless you absolutely must.
  • Consider the biggest trade-offs: time and effort. Is it worth it?

Resources